[BugFix] Fix Whitelist optimization CI failure by xiaohajiayou · Pull Request #3290 · vllm-project/vllm-omni

xiaohajiayou · 2026-05-01T09:37:58Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Reapply the deploy override field derivation change that was reverted by #3287, and explicitly restore the previous deploy behavior for prefix caching.

The previous attempt allowed omitted deploy fields to fall through to vLLM defaults. For enable_prefix_caching, that changed behavior from Omni's previous default False to vLLM's model-dependent fallback. For decoder/generative stages, vLLM usually resolves this to True, which exposed unsupported Omni multi-stage prefix-cache paths and caused L3/L4 CI failures.

This PR keeps the config refactor, but makes the old behavior explicit by setting enable_prefix_caching: false on all deploy stages.

Changes

Reapply [Config Refactor] Derive deploy override fields from stage config #3162 by reverting the revert commit [CI failed]Revert "[Config Refactor] Derive deploy override fields from stage config" #3287.
Add explicit enable_prefix_caching: false to every stage in vllm_omni/deploy/*.yaml.
Preserve the schema/default refactor while avoiding accidental vLLM fallback behavior for prefix cache.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

chatgpt-codex-connector · 2026-05-01T09:38:03Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Reapply the deploy override field derivation that was reverted in vllm-project#3287 and make prefix-cache behavior explicit in deploy configs. This preserves the config refactor while restoring the previous Omni behavior where deploy stages do not accidentally fall through to vLLM's model-dependent prefix-cache default. Signed-off-by: xiaohajiayou <923390377@qq.com>

xiaohajiayou · 2026-05-01T09:42:18Z

Could you help run the full CI checks to make sure there are no other issues? @lishunyang12 @Gaohan123 @hsliuustc0106

Signed-off-by: xiaohajiayou <923390377@qq.com>

xiaohajiayou · 2026-05-02T14:04:00Z

Updated according to your comments:

skipped the known MiMo CI failure case for now
reverted the unnecessary enforce_eager refactor
grouped comments for the StageDeployConfig parameters

Could you please take another look when you have time and let me know if there are any remaining issues?

Signed-off-by: xiaohajiayou <923390377@qq.com>

hsliuustc0106

Review Summary

What I validated:

DCO, pre-commit, mergeability all passing
Fresh unit tests cover the core config nullification behavior (test_default_stage_config_ignores_none_deploy_overrides, test_to_omegaconf_omits_none_deploy_overrides_for_engine_args, test_deploy_override_fields_include_deploy_schema_fields)
deploy_override_field_names() unification into stage_config.py is clean and removes the duplicated allowlist from arg_utils.py
enable_prefix_caching: false is now explicit on all deploy YAML stages — this is the right fix for the CI regression

What must change before approval:

cosyvoice3.yaml: disable_hybrid_kv_cache_manager: true was silently removed and replaced with enable_prefix_caching: false. These are different settings (hybrid KV cache manager vs prefix caching). If this was intentional, please explain why disabling the hybrid KV cache manager is no longer needed for cosyvoice3. If unintentional, restore it alongside the new enable_prefix_caching line.
Removal of engine_args.setdefault("max_num_seqs", 1) without ensuring all deploy YAMLs set max_num_seqs. Deploy configs like qwen3_omni_moe.yaml (all 3 stages) and cosyvoice3.yaml (both stages) don't set max_num_seqs. With the old setdefault, they got max_num_seqs=1. Now they will fall through to vLLM EngineArgs default of 256. This could change scheduling behavior and memory allocation. Either restore the setdefault or add explicit max_num_seqs values to every stage in every deploy YAML that currently omits it.
Missing test evidence. The PR body's test plan and test results sections are empty. This is a >10 file change that previously caused CI failures (#3287). Please run L3 tests locally and paste the results.

Non-blocking:

buildkite CI is still pending — wait for it to complete before merging

Reviewed by Claude Code

xiaohajiayou · 2026-05-02T16:21:15Z

Previously, the fix added several missing fields in deploy_config that were exposed during model migration, and set defaults in deploy_config to None, treating the vLLM config as the single source of truth.

However, during CI fixes, two different issues got mixed together:

For defaults removed from deploy_config:
if a field was not explicitly set in the original YAML, it now needs to be explicitly added back to preserve behavior consistency before and after this PR.
Some migrated YAMLs are missing fields and implicitly rely on defaults.
In addition, some of these fields may not need to be user-configurable and could be handled in the pipeline (e.g., pipeline.py).

To keep this PR focused and easier to reason about, this PR only addresses the first issue (i.e., preserving previous behavior for migrated YAMLs).
The second issue will be handled in a follow-up, where we can more systematically clean up and define intended configs.

Based on this, this PR updates the migrated deploy YAMLs to explicitly restore only those defaults that differ between old Omni behavior and vLLM defaults, as summarized below:

Field	Old Omni default	vLLM default / final default	Action
`gpu_memory_utilization`	`0.9`	`0.9`	no explicit override needed
`tensor_parallel_size`	`1`	`1`	no explicit override needed
`enforce_eager`	`False`	`False`	no explicit override needed
`data_parallel_size`	`1`	`1`	no explicit override needed
`pipeline_parallel_size`	`1`	`1`	no explicit override needed
`trust_remote_code`	`True`	`False`	explicitly preserved per stage where needed
`enable_prefix_caching`	`False`	`None` (often resolves to `True`)	explicitly preserved where needed
`max_num_batched_tokens`	`32768`	inferred by vLLM (e.g., `2048` / `8192` / `16384`)	explicitly preserved where needed
`max_num_seqs`	`1`	often `256` / `1024`	explicitly preserved where needed

The goal here is conservative compatibility: keep migrated deploy YAML behavior aligned with pre-refactor Omni defaults, instead of silently falling through to vLLM defaults.

Follow-up todo issue:
#3313

Signed-off-by: xiaohajiayou <923390377@qq.com>

hsliuustc0106 · 2026-05-02T16:59:17Z

+    mm_processor_cache_gb: float | None = None
+
+    # Profiling, tokenizer/config parsing, and model-loading behavior.
+    profiler_config: dict[str, Any] | None = None


@bjf-frz @david6666666 PTAL

… own group Move devices and tensor_parallel_size into a dedicated "GPU resources and parallelism" section, leaving stage_id alone as stage identity. Change devices default from "0" to None, and tighten the None check in merge_pipeline_deploy to avoid writing a spurious "devices" key. Signed-off-by: xiaohajiayou <923390377@qq.com>

hsliuustc0106 · 2026-05-03T02:45:08Z

update for ming as well after #3154 merged

Signed-off-by: xiaohajiayou <923390377@qq.com>

xiaohajiayou · 2026-05-04T14:13:31Z

It seems most of the discussion is now around whether these defaults should be fully removed from DeployConfig and instead be explicitly defined in YAMLs.
Previously, there were two main considerations:

Some fields (e.g., trust_remote_code) are not user-configurable, so they were considered to be handled in pipeline.py.

In [Refactor] Remove redundant StageDeployConfig fields, delegate to vLLM defaults #3128, we also explored removing vLLM-related fields maintained in StageDeployConfig.

Based on these, in this PR , I removed the implicit reliance on defaults and explicitly materialized them in the YAMLs.

My plan is to continue the discussion in a follow-up issue (#3313). This way, even if we later remove vLLM fields from StageDeployConfig, we won’t run into issues caused by implicit default drift.

xiaohajiayou · 2026-05-05T14:35:04Z

Known good commit: 9d9b720. CI was still passing there.

After that, this PR only had one real own change: ad83e5f, which is mainly YAML field ordering/schema cleanup and does not touch Qwen3-TTS online serving, runtime helpers, or the assertion logic. The rest of the changes were brought in by updating/merging main.

Between 9d9b7209 and the current failing state, the Qwen3-TTS related changes seem to be from main, not this PR:

5fc0bfe0 / PR [Cleanup] Use tokens_input() for TTS prompt construction #3227: changed TTS prompt construction in serving_speech.py; this looks most suspicious.
c007d40b / PR [NPU] Upgrade to v0.20.0 & align with GPU model runner #3325: changed qwen3_tts_talker.py and added an NPU override in qwen3_tts.yaml.
bb239fa9 / PR [Core] Support Async & Sync AutoRegressive Scheduling #3306: changed AR scheduler logic.

I also checked the CI results for those individual PRs, and they all passed, which makes this a bit strange.

amy-why-3459 · 2026-05-05T14:40:18Z

If possible, could you add an omni-test label to check if the changes in this PR have any impact on performance? @gcanlin @lishunyang12

Signed-off-by: xiaohajiayou <923390377@qq.com>

xiaohajiayou · 2026-05-05T15:37:12Z

I ran this Qwen3-TTS CI test locally and it passes on my side. Here are the logs:

cmd

cd /root/vllm-omni
MODEL_PREFIX=/root/models /root/vllm-omni/.venv/bin/python -m pytest \
  tests/e2e/online_serving/test_qwen3_tts_base.py::test_text_to_audio_001[async_chunk] \
  -m advanced_model -s --run-level advanced_model

result

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
--- Running Summary
============================================ 1 passed, 18 warnings in 392.15s (0:06:32) =============================================

Click to expand full logs

(vllm-omni) root@autodl-container-xs2vhvepls-41228bbf:~/vllm-omni# cd /root/vllm-omni
MODEL_PREFIX=/root/models /root/vllm-omni/.venv/bin/python -m pytest \
  tests/e2e/online_serving/test_qwen3_tts_base.py::test_text_to_audio_001[async_chunk] \
  -m advanced_model -s --run-level advanced_model
======================================================== test session starts ========================================================
platform linux -- Python 3.12.3, pytest-9.0.3, pluggy-1.6.0
rootdir: /root/vllm-omni
configfile: pyproject.toml
plugins: anyio-4.13.0, mock-3.15.1
collecting ... INFO 05-05 23:24:00 [nixl_utils.py:20] Setting UCX_RCACHE_MAX_UNRELEASED to '1024' to avoid a rare memory leak in UCX when using NIXL.
WARNING 05-05 23:24:00 [nixl_utils.py:34] NIXL is not available
WARNING 05-05 23:24:00 [nixl_utils.py:44] NIXL agent config is not available
collected 1 item                                                                                                                    

tests/e2e/online_serving/test_qwen3_tts_base.py INFO 05-05 23:24:00 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 05-05 23:24:00 [vllm.py:840] Asynchronous scheduling is enabled.
INFO 05-05 23:24:00 [kernel.py:205] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
Path load_format does not exist
Path load_format does not exist
Pre-test GPU status:
[GPU Memory Monitor] Waiting for GPU 0 to free memory, Condition: Memory usage ratio ≤ 5.0%
[GPU Memory Status] Current usage:
  GPU 0: 0.5GiB/32.0GiB (1.6%)
[GPU Memory Freed] Devices 0 meet memory condition
   Condition: Memory usage ratio ≤ 5.0%
   Wait time: 0.0 seconds (0.0 minutes)
Post-test GPU status:

================================================================================
NVIDIA GPU Information (nvidia-smi)
================================================================================
Tue May  5 23:24:00 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.58.03              Driver Version: 595.58.03      CUDA Version: 13.2     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4080        On  |   00000000:4F:00.0 Off |                  N/A |
| 30%   32C    P8             13W /  320W |       1MiB /  32760MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

================================================================================
Detailed GPU Processes (nvidia-smi pmon)
================================================================================
# gpu         pid   type     sm    mem    enc    dec    jpg    ofa    command 
# Idx           #    C/G      %      %      %      %      %      %    name 
    0          -     -      -      -      -      -      -      -    -              


================================================================================
System Processes with GPU keywords
================================================================================
Launching OmniServer with: /root/vllm-omni/.venv/bin/python -m vllm_omni.entrypoints.cli.main serve /root/models/Qwen/Qwen3-TTS-12Hz-1.7B-Base --host 127.0.0.1 --port 34085 --omni --trust-remote-code --disable-log-stats --stage-init-timeout 600 --init-timeout 900 --stage-configs-path /tmp/qwen3_tts_op0ng0ik.yaml
/root/vllm-omni/vllm_omni/version.py:55: RuntimeWarning: vLLM and vLLM-Omni appear to have mismatched major/minor versions:
 --> vLLM-Omni version 0.19.0rc2.dev202+g8e64c7ef2
 --> vLLM version 0.20.0
This will likely cause compatibility issues.
  warn_if_misaligned_vllm_version()
WARNING 05-05 23:24:10 [nixl_utils.py:34] NIXL is not available
WARNING 05-05 23:24:10 [nixl_utils.py:44] NIXL agent config is not available
INFO 05-05 23:24:12 [logo.py:45]        █     █     █▄   ▄█       ▄▀▀▀▀▄ █▄   ▄█ █▄    █ ▀█▀ 
INFO 05-05 23:24:12 [logo.py:45]  ▄▄ ▄█ █     █     █ ▀▄▀ █  ▄▄▄  █    █ █ ▀▄▀ █ █ ▀▄  █  █  
INFO 05-05 23:24:12 [logo.py:45]   █▄█▀ █     █     █     █       █    █ █     █ █   ▀▄█  █  
INFO 05-05 23:24:12 [logo.py:45]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀        ▀▀▀▀  ▀     ▀ ▀     ▀ ▀▀▀ 
INFO 05-05 23:24:12 [logo.py:45] 
(APIServer pid=9684) INFO 05-05 23:24:12 [utils.py:299] vLLM server version 0.20.0, serving model /root/models/Qwen/Qwen3-TTS-12Hz-1.7B-Base
(APIServer pid=9684) INFO 05-05 23:24:13 [utils.py:233] non-default args: {'model_tag': '/root/models/Qwen/Qwen3-TTS-12Hz-1.7B-Base', 'host': '127.0.0.1', 'port': 34085, 'model': '/root/models/Qwen/Qwen3-TTS-12Hz-1.7B-Base', 'tokenizer_mode': None, 'trust_remote_code': True, 'dtype': None, 'enforce_eager': None, 'config_format': None, 'load_format': None, 'pipeline_parallel_size': None, 'tensor_parallel_size': None, 'data_parallel_size': None, 'gpu_memory_utilization': None, 'mm_processor_cache_gb': None, 'skip_mm_profiling': None, 'compilation_config': None, 'profiler_config': None, 'disable_log_stats': True}
(APIServer pid=9684) INFO 05-05 23:24:13 [omni_base.py:153] [AsyncOmni] Initializing with model /root/models/Qwen/Qwen3-TTS-12Hz-1.7B-Base
(APIServer pid=9684) INFO 05-05 23:24:13 [async_omni_engine.py:290] [AsyncOmniEngine] Initializing with model /root/models/Qwen/Qwen3-TTS-12Hz-1.7B-Base
(APIServer pid=9684) WARNING 05-05 23:24:13 [async_omni_engine.py:1418] stage_configs_path is set — the following top-level engine args are ignored (per-stage YAML takes precedence): attention_config, disable_log_stats, eplb_config, ir_op_priority, kernel_config, reasoning_parser_plugin, structured_outputs_config, trust_remote_code
(APIServer pid=9684) WARNING 05-05 23:24:13 [utils.py:191] Filtered out 1 callable object(s) from base_engine_args that are not compatible with OmegaConf: ['dispatch_function']. 
(APIServer pid=9684) INFO 05-05 23:24:13 [async_omni_engine.py:350] [AsyncOmniEngine] Launching Orchestrator thread with 2 stages
(APIServer pid=9684) INFO 05-05 23:24:13 [initialization.py:351] Loaded OmniTransferConfig with 1 connector configurations
(APIServer pid=9684) INFO 05-05 23:24:13 [async_omni_engine.py:767] [AsyncOmniEngine] Initializing stage 0
(APIServer pid=9684) INFO 05-05 23:24:13 [stage_init_utils.py:386] [stage_init] Stage-0 set runtime devices: 0
(APIServer pid=9684) INFO 05-05 23:24:13 [async_omni_engine.py:767] [AsyncOmniEngine] Initializing stage 1
(APIServer pid=9684) WARNING 05-05 23:24:13 [envs.py:1818] Unknown vLLM environment variable detected: VLLM_TEST_CLEAN_GPU_MEMORY
(APIServer pid=9684) INFO 05-05 23:24:13 [configuration_qwen3_tts.py:487] talker_config is None. Initializing talker model with default values
(APIServer pid=9684) INFO 05-05 23:24:13 [configuration_qwen3_tts.py:490] speaker_encoder_config is None. Initializing talker model with default values
(APIServer pid=9684) INFO 05-05 23:24:13 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(APIServer pid=9684) INFO 05-05 23:24:13 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(APIServer pid=9684) INFO 05-05 23:24:23 [model.py:555] Resolved architecture: Qwen3TTSTalkerForConditionalGeneration
(APIServer pid=9684) INFO 05-05 23:24:23 [model.py:1680] Using max model len 4096
(APIServer pid=9684) INFO 05-05 23:24:23 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=512.
(APIServer pid=9684) INFO 05-05 23:24:23 [vllm.py:840] Asynchronous scheduling is enabled.
(APIServer pid=9684) INFO 05-05 23:24:23 [kernel.py:205] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(APIServer pid=9684) INFO 05-05 23:24:23 [async_omni_engine.py:467] [AsyncOmniEngine] Stage 0 engine launch started
(APIServer pid=9684) INFO 05-05 23:24:23 [stage_init_utils.py:386] [stage_init] Stage-1 set runtime devices: 0
(APIServer pid=9684) WARNING 05-05 23:24:23 [envs.py:1818] Unknown vLLM environment variable detected: VLLM_TEST_CLEAN_GPU_MEMORY
(APIServer pid=9684) INFO 05-05 23:24:23 [configuration_qwen3_tts.py:487] talker_config is None. Initializing talker model with default values
(APIServer pid=9684) INFO 05-05 23:24:23 [configuration_qwen3_tts.py:490] speaker_encoder_config is None. Initializing talker model with default values
(APIServer pid=9684) INFO 05-05 23:24:23 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(APIServer pid=9684) INFO 05-05 23:24:23 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
/root/vllm-omni/vllm_omni/version.py:55: RuntimeWarning: vLLM and vLLM-Omni appear to have mismatched major/minor versions:
 --> vLLM-Omni version 0.19.0rc2.dev202+g8e64c7ef2
 --> vLLM version 0.20.0
This will likely cause compatibility issues.
  warn_if_misaligned_vllm_version()
WARNING 05-05 23:24:32 [nixl_utils.py:34] NIXL is not available
WARNING 05-05 23:24:32 [nixl_utils.py:44] NIXL agent config is not available
(APIServer pid=9684) INFO 05-05 23:24:34 [model.py:555] Resolved architecture: Qwen3TTSCode2Wav
(APIServer pid=9684) INFO 05-05 23:24:34 [model.py:1680] Using max model len 65536
(APIServer pid=9684) INFO 05-05 23:24:34 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=65536.
(APIServer pid=9684) WARNING 05-05 23:24:34 [vllm.py:896] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(APIServer pid=9684) WARNING 05-05 23:24:34 [vllm.py:914] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(APIServer pid=9684) INFO 05-05 23:24:34 [kernel.py:205] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['vllm_c', 'native'])
(APIServer pid=9684) INFO 05-05 23:24:34 [vllm.py:1089] Cudagraph is disabled under eager mode
(APIServer pid=9684) INFO 05-05 23:24:34 [compilation.py:303] Enabled custom fusions: norm_quant, act_quant
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:34 [core.py:109] Initializing a V1 LLM engine (v0.20.0) with config: model='/root/models/Qwen/Qwen3-TTS-12Hz-1.7B-Base', speculative_config=None, tokenizer='/root/models/Qwen/Qwen3-TTS-12Hz-1.7B-Base', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/root/models/Qwen/Qwen3-TTS-12Hz-1.7B-Base, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::gdn_attention_core_xpu', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::deepseek_v4_attention', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [512], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 16, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native']), enable_flashinfer_autotune=True, moe_backend='auto')
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:34 [parallel_state.py:1402] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.17.0.4:37609 backend=nccl
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:34 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(StageEngineCoreProc pid=9834) WARNING 05-05 23:24:35 [base.py:188] [LLM Worker 0] Sleep Mode DISABLED.
(StageEngineCoreProc pid=9834) WARNING 05-05 23:24:35 [base.py:188] [LLM Worker 0] Sleep Mode DISABLED.
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:35 [gpu_model_runner.py:4777] Starting to load model /root/models/Qwen/Qwen3-TTS-12Hz-1.7B-Base...
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:36 [cuda.py:368] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:36 [flash_attn.py:646] Using FlashAttention version 2
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:36 [voice_cache.py:43] Voice embedding cache initialized (max_entries=128)
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:36 [weight_utils.py:904] Filesystem type for checkpoints: OVERLAY. Checkpoint size: 3.59 GiB. Available RAM: 450.39 GiB.
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:36 [weight_utils.py:927] Auto-prefetch is disabled because the filesystem (OVERLAY) is not a recognized network FS (NFS/Lustre). If you want to force prefetching, start vLLM with --safetensors-load-strategy=prefetch.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.41it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.41it/s]
(StageEngineCoreProc pid=9834) 
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:37 [qwen3_tts_talker.py:1656] Loaded 396 weights for Qwen3TTSTalkerForConditionalGeneration
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:37 [default_loader.py:384] Loading weights took 0.90 seconds
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:38 [gpu_model_runner.py:4879] Model loading took 3.63 GiB memory and 1.898549 seconds
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:38 [configuration_qwen3_tts.py:487] talker_config is None. Initializing talker model with default values
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:38 [configuration_qwen3_tts.py:490] speaker_encoder_config is None. Initializing talker model with default values
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:38 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:38 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:38 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:42 [configuration_qwen3_tts.py:487] talker_config is None. Initializing talker model with default values
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:42 [configuration_qwen3_tts.py:490] speaker_encoder_config is None. Initializing talker model with default values
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:42 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:42 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:42 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:42 [backends.py:1069] Using cache directory: /root/.cache/vllm/torch_compile_cache/5edd4a18a8/rank_0_0/backbone for vLLM's torch.compile
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:42 [backends.py:1128] Dynamo bytecode transform time: 4.06 s
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:44 [backends.py:290] Directly load the compiled graph(s) for compile range (1, 512) from the cache, took 1.249 s
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:44 [decorators.py:305] Directly load AOT compilation from path /root/.cache/vllm/torch_compile_cache/torch_aot_compile/3e289124d765cb14db476cf476e22bf6f15ab2acf33192b13414f2f7efd02f88/rank_0_0/model
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:44 [monitor.py:53] torch.compile took 6.05 s in total
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:44 [monitor.py:81] Initial profiling/warmup run took 0.17 s
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:45 [base.py:163] Available KV cache memory: 5.72 GiB (profiling fallback)
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:45 [kv_cache_utils.py:1711] GPU KV cache size: 53,504 tokens
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:45 [kv_cache_utils.py:1716] Maximum concurrency for 4,096 tokens per request: 13.06x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|█████████████████████████████████████████| 5/5 [00:00<00:00, 21.85it/s]
Capturing CUDA graphs (decode, FULL): 100%|████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 26.35it/s]
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:46 [gpu_model_runner.py:6133] Graph capturing finished in 1 secs, took 0.06 GiB
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:46 [core.py:299] init engine (profile, create kv cache, warmup model) took 8.40 s (compilation: 6.05 s)
(StageEngineCoreProc pid=9834) The tokenizer you are loading from '/root/models/Qwen/Qwen3-TTS-12Hz-1.7B-Base' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
(StageEngineCoreProc pid=9834) WARNING 05-05 23:24:47 [scheduler.py:181] Using custom scheduler class vllm_omni.core.sched.omni_ar_scheduler.OmniARAsyncScheduler. This scheduler interface is not public and compatibility may not be maintained.
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:47 [factory.py:46] Created connector: SharedMemoryConnector
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:47 [vllm.py:840] Asynchronous scheduling is enabled.
(APIServer pid=9684) INFO 05-05 23:24:47 [async_omni_engine.py:484] [AsyncOmniEngine] Stage 0 engine startup completed
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:47 [kernel.py:205] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(APIServer pid=9684) INFO 05-05 23:24:47 [async_omni_engine.py:467] [AsyncOmniEngine] Stage 1 engine launch started
/root/vllm-omni/vllm_omni/version.py:55: RuntimeWarning: vLLM and vLLM-Omni appear to have mismatched major/minor versions:
 --> vLLM-Omni version 0.19.0rc2.dev202+g8e64c7ef2
 --> vLLM version 0.20.0
This will likely cause compatibility issues.
  warn_if_misaligned_vllm_version()
WARNING 05-05 23:24:56 [nixl_utils.py:34] NIXL is not available
WARNING 05-05 23:24:56 [nixl_utils.py:44] NIXL agent config is not available
(StageEngineCoreProc pid=10102) INFO 05-05 23:24:58 [core.py:109] Initializing a V1 LLM engine (v0.20.0) with config: model='/root/models/Qwen/Qwen3-TTS-12Hz-1.7B-Base', speculative_config=None, tokenizer='/root/models/Qwen/Qwen3-TTS-12Hz-1.7B-Base', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=65536, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, quantization_config=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/root/models/Qwen/Qwen3-TTS-12Hz-1.7B-Base, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'ir_enable_torch_wrap': False, 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [65536], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['vllm_c', 'native']), enable_flashinfer_autotune=True, moe_backend='auto')
(StageEngineCoreProc pid=10102) INFO 05-05 23:24:58 [parallel_state.py:1402] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.17.0.4:57047 backend=nccl
(StageEngineCoreProc pid=10102) INFO 05-05 23:24:58 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(StageEngineCoreProc pid=10102) WARNING 05-05 23:24:59 [base.py:188] [LLM Worker 0] Sleep Mode DISABLED.
(StageEngineCoreProc pid=10102) WARNING 05-05 23:24:59 [base.py:188] [LLM Worker 0] Sleep Mode DISABLED.
(StageEngineCoreProc pid=10102) INFO 05-05 23:24:59 [gpu_model_runner.py:4777] Starting to load model /root/models/Qwen/Qwen3-TTS-12Hz-1.7B-Base...
(StageEngineCoreProc pid=10102) INFO 05-05 23:24:59 [default_loader.py:384] Loading weights took 1749513.05 seconds
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:00 [gpu_model_runner.py:4879] Model loading took 0.0 GiB memory and 0.002176 seconds
(StageEngineCoreProc pid=10102) `torch_dtype` is deprecated! Use `dtype` instead!
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:00 [configuration_qwen3_tts_tokenizer_v2.py:156] encoder_config is None. Initializing encoder with default values
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:00 [configuration_qwen3_tts_tokenizer_v2.py:159] decoder_config is None. Initializing decoder with default values
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:00 [modeling_qwen3_tts_tokenizer_v2.py:969] Precomputed exp caches for 29 SnakeBeta activations
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:00 [cuda_graph_decoder_wrapper.py:105] Starting CUDA Graph warmup for 11 sizes: [2, 4, 8, 16, 25, 32, 64, 97, 128, 256, 325]
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:02 [cuda_graph_decoder_wrapper.py:118]   Captured CUDA Graph for size=2
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:02 [cuda_graph_decoder_wrapper.py:118]   Captured CUDA Graph for size=4
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:02 [cuda_graph_decoder_wrapper.py:118]   Captured CUDA Graph for size=8
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:02 [cuda_graph_decoder_wrapper.py:118]   Captured CUDA Graph for size=16
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:03 [cuda_graph_decoder_wrapper.py:118]   Captured CUDA Graph for size=25
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:03 [cuda_graph_decoder_wrapper.py:118]   Captured CUDA Graph for size=32
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:03 [cuda_graph_decoder_wrapper.py:118]   Captured CUDA Graph for size=64
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:03 [cuda_graph_decoder_wrapper.py:118]   Captured CUDA Graph for size=97
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:03 [cuda_graph_decoder_wrapper.py:118]   Captured CUDA Graph for size=128
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:03 [cuda_graph_decoder_wrapper.py:118]   Captured CUDA Graph for size=256
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:03 [cuda_graph_decoder_wrapper.py:118]   Captured CUDA Graph for size=325
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:03 [cuda_graph_decoder_wrapper.py:123] CUDA Graph warmup complete: 11/11 captured
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:03 [modeling_qwen3_tts_tokenizer_v2.py:999] CUDA Graph enabled for decoder: seq_lens=[2, 4, 8, 16, 25, 32, 64, 97, 128, 256, 325]
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:03 [qwen3_tts_code2wav.py:158] Code2Wav decoder CUDA Graph enabled
(StageEngineCoreProc pid=10102) WARNING 05-05 23:25:03 [qwen3_tts_code2wav.py:260] Code2Wav input_ids length 1 not divisible by num_quantizers 16; skipping malformed request.
(StageEngineCoreProc pid=10102) WARNING 05-05 23:25:03 [gpu_generation_model_runner.py:472] Dummy sampler run is not implemented for generation model
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:03 [core.py:306] init engine (profile, create kv cache, warmup model) took 3.34 s
(StageEngineCoreProc pid=10102) The tokenizer you are loading from '/root/models/Qwen/Qwen3-TTS-12Hz-1.7B-Base' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
(StageEngineCoreProc pid=10102) WARNING 05-05 23:25:04 [scheduler.py:181] Using custom scheduler class vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler. This scheduler interface is not public and compatibility may not be maintained.
(StageEngineCoreProc pid=10102) WARNING 05-05 23:25:04 [core.py:138] Disabling chunked prefill for model without KVCache
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:04 [factory.py:46] Created connector: SharedMemoryConnector
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:04 [vllm.py:840] Asynchronous scheduling is enabled.
(StageEngineCoreProc pid=10102) WARNING 05-05 23:25:04 [vllm.py:896] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(StageEngineCoreProc pid=10102) WARNING 05-05 23:25:04 [vllm.py:914] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:04 [kernel.py:205] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['vllm_c', 'native'])
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:04 [vllm.py:1089] Cudagraph is disabled under eager mode
(APIServer pid=9684) INFO 05-05 23:25:04 [async_omni_engine.py:484] [AsyncOmniEngine] Stage 1 engine startup completed
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:04 [compilation.py:303] Enabled custom fusions: norm_quant, act_quant
(APIServer pid=9684) INFO 05-05 23:25:04 [stage_engine_core_client.py:134] [StageEngineCoreClient] Stage-0 initializing EngineCore
(APIServer pid=9684) INFO 05-05 23:25:04 [stage_engine_core_client.py:134] [StageEngineCoreClient] Stage-1 initializing EngineCore
(APIServer pid=9684) INFO 05-05 23:25:04 [stage_engine_core_client.py:174] [StageEngineCoreClient] Stage-1 EngineCore running
(APIServer pid=9684) INFO 05-05 23:25:04 [configuration_qwen3_tts.py:487] talker_config is None. Initializing talker model with default values
(APIServer pid=9684) INFO 05-05 23:25:04 [configuration_qwen3_tts.py:490] speaker_encoder_config is None. Initializing talker model with default values
(APIServer pid=9684) INFO 05-05 23:25:04 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(APIServer pid=9684) INFO 05-05 23:25:04 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(APIServer pid=9684) INFO 05-05 23:25:04 [stage_engine_core_client.py:174] [StageEngineCoreClient] Stage-0 EngineCore running
(APIServer pid=9684) INFO 05-05 23:25:05 [configuration_qwen3_tts.py:487] talker_config is None. Initializing talker model with default values
(APIServer pid=9684) INFO 05-05 23:25:05 [configuration_qwen3_tts.py:490] speaker_encoder_config is None. Initializing talker model with default values
(APIServer pid=9684) INFO 05-05 23:25:05 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(APIServer pid=9684) The tokenizer you are loading from '/root/models/Qwen/Qwen3-TTS-12Hz-1.7B-Base' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
(APIServer pid=9684) INFO 05-05 23:25:05 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(APIServer pid=9684) INFO 05-05 23:25:05 [async_omni_engine.py:702] [AsyncOmniEngine] Stage 1 initialized
(APIServer pid=9684) The tokenizer you are loading from '/root/models/Qwen/Qwen3-TTS-12Hz-1.7B-Base' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
(APIServer pid=9684) INFO 05-05 23:25:05 [async_omni_engine.py:702] [AsyncOmniEngine] Stage 0 initialized
(APIServer pid=9684) INFO 05-05 23:25:05 [orchestrator.py:192] [Orchestrator] Starting event loop
(APIServer pid=9684) INFO 05-05 23:25:05 [async_omni_engine.py:378] [AsyncOmniEngine] Orchestrator ready with 2 stages
(APIServer pid=9684) INFO 05-05 23:25:05 [omni_base.py:166] [AsyncOmni] AsyncOmniEngine initialized in 52.99 seconds
(APIServer pid=9684) INFO 05-05 23:25:05 [omni_base.py:185] [AsyncOmni] Initialized with 2 stages for model /root/models/Qwen/Qwen3-TTS-12Hz-1.7B-Base
(APIServer pid=9684) INFO 05-05 23:25:06 [api_server.py:651] Supported tasks: {'generate', 'speech'}
(APIServer pid=9684) WARNING 05-05 23:25:06 [model.py:1437] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'repetition_penalty': 1.05, 'temperature': 0.9, 'max_tokens': 8192}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=9684) The tokenizer you are loading from '/root/models/Qwen/Qwen3-TTS-12Hz-1.7B-Base' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
(APIServer pid=9684) INFO 05-05 23:25:06 [hf.py:314] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=9684) WARNING 05-05 23:25:06 [serving_speech.py:401] No speakers found in config (checked spk_id and speaker_id)
(APIServer pid=9684) WARNING 05-05 23:25:06 [serving_speech.py:234] Uploaded voices are ephemeral and will be lost on server restart. Re-upload voices after each restart if needed.
(APIServer pid=9684) INFO 05-05 23:25:06 [serving_speech.py:242] Loaded 0 supported speakers: []
(APIServer pid=9684) INFO 05-05 23:25:06 [serving_speech.py:293] Loaded codec frame rate: 12.5 Hz (output_sample_rate=24000, encode_downsample_rate=1920)
(APIServer pid=9684) INFO 05-05 23:25:06 [serving.py:45] OpenAIServingRealtime initialized for task: realtime
(APIServer pid=9684) INFO 05-05 23:25:06 [api_server.py:424] Starting vLLM API server 0 on http://127.0.0.1:34085
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:37] Available routes are:
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /generative_scoring, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/audio/speech, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/audio/speech/batch, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/audio/generate, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/audio/voices, Methods: GET
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/audio/voices, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/audio/voices/{name}, Methods: DELETE
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/images/generations, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/images/edits, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/videos, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/videos/sync, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/videos, Methods: GET
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/videos/{video_id}, Methods: GET
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/videos/{video_id}, Methods: DELETE
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/videos/{video_id}/content, Methods: GET
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/omni/sleep, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/omni/wakeup, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:57] Route: /v1/audio/speech/stream, Endpoint: streaming_speech
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:57] Route: /v1/video/chat/stream, Endpoint: streaming_video_chat
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:57] Route: /v1/realtime, Endpoint: realtime_websocket
(APIServer pid=9684) INFO:     Started server process [9684]
(APIServer pid=9684) INFO:     Waiting for application startup.
(APIServer pid=9684) INFO:     Application startup complete.
Server ready on 127.0.0.1:34085
OmniServer started successfully

=== PRE-TEST GPU CLEANUP ===

Skipping GPU memory cleanup check (typically: instance already up; no check needed between tests)

--- Running test: test_text_to_audio_001[async_chunk]
(APIServer pid=9684) The tokenizer you are loading from '/root/models/Qwen/Qwen3-TTS-12Hz-1.7B-Base' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
(APIServer pid=9684) INFO 05-05 23:25:08 [serving_speech.py:1822] TTS speech request speech-870395dad3e04f08: text='The weather is nice today, perfect for a walk in t...', model=Base
(APIServer pid=9684) INFO 05-05 23:25:08 [orchestrator.py:901] [Orchestrator] _handle_add_request: stage=0 req=speech-870395dad3e04f08 prompt_type=OmniEngineCoreRequest original_prompt_type=dict final_stage=1 num_sampling_params=2
(APIServer pid=9684) INFO 05-05 23:25:08 [serving_speech.py:1822] TTS speech request speech-ad4b71432ea76f84: text='The weather is nice today, perfect for a walk in t...', model=Base
(APIServer pid=9684) INFO 05-05 23:25:08 [stage_engine_core_client.py:230] [StageEngineCoreClient] Stage-0 adding request: speech-870395dad3e04f08
(APIServer pid=9684) INFO 05-05 23:25:08 [serving_speech.py:1822] TTS speech request speech-98c95225cab8a7a1: text='The weather is nice today, perfect for a walk in t...', model=Base
(APIServer pid=9684) INFO 05-05 23:25:08 [serving_speech.py:1822] TTS speech request speech-a4550679581b1dc5: text='The weather is nice today, perfect for a walk in t...', model=Base
(APIServer pid=9684) INFO 05-05 23:25:08 [serving_speech.py:1822] TTS speech request speech-ac4cfcbae885fe1d: text='The weather is nice today, perfect for a walk in t...', model=Base
(StageEngineCoreProc pid=9834) WARNING 05-05 23:25:08 [gpu_model_runner.py:390] additional_information on request data is deprecated, use model_intermediate_buffer
(StageEngineCoreProc pid=9834) WARNING 05-05 23:25:08 [gpu_model_runner.py:1145] additional_information on scheduled_cached_reqs is deprecated, use model_intermediate_buffer
(APIServer pid=9684) INFO 05-05 23:25:08 [stage_engine_core_client.py:230] [StageEngineCoreClient] Stage-1 adding request: speech-870395dad3e04f08
(APIServer pid=9684) INFO 05-05 23:25:08 [orchestrator.py:901] [Orchestrator] _handle_add_request: stage=0 req=speech-ad4b71432ea76f84 prompt_type=OmniEngineCoreRequest original_prompt_type=dict final_stage=1 num_sampling_params=2
(APIServer pid=9684) INFO 05-05 23:25:08 [stage_engine_core_client.py:230] [StageEngineCoreClient] Stage-0 adding request: speech-ad4b71432ea76f84
(APIServer pid=9684) INFO 05-05 23:25:08 [stage_engine_core_client.py:230] [StageEngineCoreClient] Stage-1 adding request: speech-ad4b71432ea76f84
(APIServer pid=9684) INFO 05-05 23:25:08 [orchestrator.py:901] [Orchestrator] _handle_add_request: stage=0 req=speech-98c95225cab8a7a1 prompt_type=OmniEngineCoreRequest original_prompt_type=dict final_stage=1 num_sampling_params=2
(APIServer pid=9684) INFO 05-05 23:25:08 [stage_engine_core_client.py:230] [StageEngineCoreClient] Stage-0 adding request: speech-98c95225cab8a7a1
(APIServer pid=9684) INFO 05-05 23:25:08 [stage_engine_core_client.py:230] [StageEngineCoreClient] Stage-1 adding request: speech-98c95225cab8a7a1
(APIServer pid=9684) INFO 05-05 23:25:08 [orchestrator.py:901] [Orchestrator] _handle_add_request: stage=0 req=speech-a4550679581b1dc5 prompt_type=OmniEngineCoreRequest original_prompt_type=dict final_stage=1 num_sampling_params=2
(APIServer pid=9684) INFO 05-05 23:25:08 [stage_engine_core_client.py:230] [StageEngineCoreClient] Stage-0 adding request: speech-a4550679581b1dc5
(APIServer pid=9684) INFO 05-05 23:25:08 [stage_engine_core_client.py:230] [StageEngineCoreClient] Stage-1 adding request: speech-a4550679581b1dc5
(APIServer pid=9684) INFO 05-05 23:25:08 [orchestrator.py:901] [Orchestrator] _handle_add_request: stage=0 req=speech-ac4cfcbae885fe1d prompt_type=OmniEngineCoreRequest original_prompt_type=dict final_stage=1 num_sampling_params=2
(APIServer pid=9684) INFO 05-05 23:25:08 [stage_engine_core_client.py:230] [StageEngineCoreClient] Stage-0 adding request: speech-ac4cfcbae885fe1d
(APIServer pid=9684) INFO 05-05 23:25:09 [stage_engine_core_client.py:230] [StageEngineCoreClient] Stage-1 adding request: speech-ac4cfcbae885fe1d
(StageEngineCoreProc pid=9834) `torch_dtype` is deprecated! Use `dtype` instead!
(StageEngineCoreProc pid=9834) INFO 05-05 23:25:09 [configuration_qwen3_tts_tokenizer_v2.py:156] encoder_config is None. Initializing encoder with default values
(StageEngineCoreProc pid=9834) INFO 05-05 23:25:09 [configuration_qwen3_tts_tokenizer_v2.py:159] decoder_config is None. Initializing decoder with default values
(StageEngineCoreProc pid=9834) WARNING 05-05 23:25:09 [gpu_model_runner.py:1497] _merge_additional_information_update is deprecated, use _update_intermediate_buffer
(StageEngineCoreProc pid=9834) INFO 05-05 23:25:14 [qwen3_code_predictor.py:568] code_predictor: warmup done for buckets [1, 2, 4, 8, 10]
(StageEngineCoreProc pid=9834) INFO 05-05 23:25:14 [qwen3_code_predictor.py:588] code_predictor: captured CUDA graphs for buckets [1, 2, 4, 8, 10]
(StageEngineCoreProc pid=9834) INFO 05-05 23:25:14 [qwen3_code_predictor.py:536] code_predictor: torch.compile (no epilogue fusion) + CUDA graphs
(StageEngineCoreProc pid=10102) WARNING 05-05 23:25:15 [gpu_model_runner.py:390] additional_information on request data is deprecated, use model_intermediate_buffer
(StageEngineCoreProc pid=10102) WARNING 05-05 23:25:15 [gpu_model_runner.py:1145] additional_information on scheduled_cached_reqs is deprecated, use model_intermediate_buffer
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:15 [qwen3_tts_code2wav.py:288] Code2Wav codec: frames=103 q=16 uniq=1072 range=[1,2047] batch=1
(StageEngineCoreProc pid=10102) WARNING 05-05 23:25:18 [qwen3_tts_code2wav.py:260] Code2Wav input_ids length 1 not divisible by num_quantizers 16; skipping malformed request.
(APIServer pid=9684) INFO:     127.0.0.1:35138 - "POST /v1/audio/speech HTTP/1.1" 200 OK
audio data is saved: ./test_b3faaa6eeb58497ab9ca371ca23b00d4.wav
(StageEngineCoreProc pid=10102) WARNING 05-05 23:25:18 [qwen3_tts_code2wav.py:260] Code2Wav input_ids length 1 not divisible by num_quantizers 16; skipping malformed request.
(APIServer pid=9684) INFO:     127.0.0.1:35140 - "POST /v1/audio/speech HTTP/1.1" 200 OK
(StageEngineCoreProc pid=10102) WARNING 05-05 23:25:21 [qwen3_tts_code2wav.py:260] Code2Wav input_ids length 1 not divisible by num_quantizers 16; skipping malformed request.
(APIServer pid=9684) INFO:     127.0.0.1:35156 - "POST /v1/audio/speech HTTP/1.1" 200 OK
(StageEngineCoreProc pid=10102) WARNING 05-05 23:25:21 [qwen3_tts_code2wav.py:260] Code2Wav input_ids length 1 not divisible by num_quantizers 16; skipping malformed request.
(APIServer pid=9684) INFO:     127.0.0.1:35162 - "POST /v1/audio/speech HTTP/1.1" 200 OK
/root/vllm-omni/vllm_omni/version.py:55: RuntimeWarning: vLLM and vLLM-Omni appear to have mismatched major/minor versions:
 --> vLLM-Omni version 0.19.0rc2.dev202+g8e64c7ef2
 --> vLLM version 0.20.0
This will likely cause compatibility issues.
  warn_if_misaligned_vllm_version()
(StageEngineCoreProc pid=10102) WARNING 05-05 23:25:22 [qwen3_tts_code2wav.py:260] Code2Wav input_ids length 1 not divisible by num_quantizers 16; skipping malformed request.
(APIServer pid=9684) INFO:     127.0.0.1:35170 - "POST /v1/audio/speech HTTP/1.1" 200 OK
WARNING 05-05 23:25:27 [nixl_utils.py:34] NIXL is not available
WARNING 05-05 23:25:27 [nixl_utils.py:44] NIXL agent config is not available
the avg e2e latency is: 16.661387036787346
audio content is: The weather is nice today. Perfect for a walk in the park.
input text is: The weather is nice today, perfect for a walk in the park.
cosine similarity text1 is: the weather is nice today perfect for a walk in the park, text2 is: the weather is nice today perfect for a walk in the park
Cosine similarity: 1.000
config.json: 2.65kB [00:00, 7.32MB/s]
model.safetensors: 100%|██████████████████████████████████████████████████████████████████████████| 378M/378M [03:23<00:00, 1.86MB/s]
preprocessor_config.json: 100%|█████████████████████████████████████████████████████████████████████| 215/215 [00:00<00:00, 1.41MB/s]
Device set to use cpu
gender classifier: label=женский, conf=0.984, gender=female, median_f0=240.6Hz
Preset voice gender check: preset='clone', estimated='female', expected='female'
audio data is saved: ./test_b25c3b3d7f754866a4d97ca7e77a35ab.wav
/root/vllm-omni/vllm_omni/version.py:55: RuntimeWarning: vLLM and vLLM-Omni appear to have mismatched major/minor versions:
 --> vLLM-Omni version 0.19.0rc2.dev202+g8e64c7ef2
 --> vLLM version 0.20.0
This will likely cause compatibility issues.
  warn_if_misaligned_vllm_version()
WARNING 05-05 23:29:12 [nixl_utils.py:34] NIXL is not available
WARNING 05-05 23:29:12 [nixl_utils.py:44] NIXL agent config is not available
the avg e2e latency is: 17.73810344398953
audio content is: The weather is nice today, perfect for a walk in the park.
input text is: The weather is nice today, perfect for a walk in the park.
cosine similarity text1 is: the weather is nice today perfect for a walk in the park, text2 is: the weather is nice today perfect for a walk in the park
Cosine similarity: 1.000
gender classifier: label=женский, conf=0.986, gender=female, median_f0=250.0Hz
Preset voice gender check: preset='clone', estimated='female', expected='female'
audio data is saved: ./test_5060caf337fa40ccbade0dcc9f0cd82a.wav
/root/vllm-omni/vllm_omni/version.py:55: RuntimeWarning: vLLM and vLLM-Omni appear to have mismatched major/minor versions:
 --> vLLM-Omni version 0.19.0rc2.dev202+g8e64c7ef2
 --> vLLM version 0.20.0
This will likely cause compatibility issues.
  warn_if_misaligned_vllm_version()
WARNING 05-05 23:29:29 [nixl_utils.py:34] NIXL is not available
WARNING 05-05 23:29:29 [nixl_utils.py:44] NIXL agent config is not available
the avg e2e latency is: 16.998421740951017
audio content is: The weather is nice today, perfect for a walk in the park.
input text is: The weather is nice today, perfect for a walk in the park.
cosine similarity text1 is: the weather is nice today perfect for a walk in the park, text2 is: the weather is nice today perfect for a walk in the park
Cosine similarity: 1.000
gender classifier: label=женский, conf=0.990, gender=female, median_f0=262.3Hz
Preset voice gender check: preset='clone', estimated='female', expected='female'
audio data is saved: ./test_40c98cf4facc46d6b10cfa93c31a0cad.wav
/root/vllm-omni/vllm_omni/version.py:55: RuntimeWarning: vLLM and vLLM-Omni appear to have mismatched major/minor versions:
 --> vLLM-Omni version 0.19.0rc2.dev202+g8e64c7ef2
 --> vLLM version 0.20.0
This will likely cause compatibility issues.
  warn_if_misaligned_vllm_version()
WARNING 05-05 23:29:46 [nixl_utils.py:34] NIXL is not available
WARNING 05-05 23:29:46 [nixl_utils.py:44] NIXL agent config is not available
the avg e2e latency is: 16.983736565103754
audio content is: The weather is nice today perfect for a walk in the park
input text is: The weather is nice today, perfect for a walk in the park.
cosine similarity text1 is: the weather is nice today perfect for a walk in the park, text2 is: the weather is nice today perfect for a walk in the park
Cosine similarity: 1.000
gender classifier: label=женский, conf=0.978, gender=female, median_f0=254.0Hz
Preset voice gender check: preset='clone', estimated='female', expected='female'
audio data is saved: ./test_a939a4ef782c429f9e2b5f8c4c514a93.wav
/root/vllm-omni/vllm_omni/version.py:55: RuntimeWarning: vLLM and vLLM-Omni appear to have mismatched major/minor versions:
 --> vLLM-Omni version 0.19.0rc2.dev202+g8e64c7ef2
 --> vLLM version 0.20.0
This will likely cause compatibility issues.
  warn_if_misaligned_vllm_version()
WARNING 05-05 23:30:04 [nixl_utils.py:34] NIXL is not available
WARNING 05-05 23:30:04 [nixl_utils.py:44] NIXL agent config is not available
the avg e2e latency is: 17.315035382052884
audio content is: The weather is nice today. Perfect for a walk in the park.
input text is: The weather is nice today, perfect for a walk in the park.
cosine similarity text1 is: the weather is nice today perfect for a walk in the park, text2 is: the weather is nice today perfect for a walk in the park
Cosine similarity: 1.000
gender classifier: label=женский, conf=0.976, gender=female, median_f0=223.8Hz
Preset voice gender check: preset='clone', estimated='female', expected='female'
.
Skipping GPU memory cleanup check (typically: instance already up; no check needed between tests)

OmniServer stopping...
(StageEngineCoreProc pid=10102) INFO 05-05 23:30:12 [core.py:1238] Shutdown initiated (timeout=0)
(StageEngineCoreProc pid=9834) INFO 05-05 23:30:12 [core.py:1238] Shutdown initiated (timeout=0)
(StageEngineCoreProc pid=10102) INFO 05-05 23:30:12 [core.py:1261] Shutdown complete
(StageEngineCoreProc pid=9834) INFO 05-05 23:30:12 [core.py:1261] Shutdown complete
[rank0]:[W505 23:30:13.533936353 ProcessGroupNCCL.cpp:1575] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank0]:[W505 23:30:13.815573753 ProcessGroupNCCL.cpp:1575] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=9684) ERROR 05-05 23:30:14 [stage_engine_core_client.py:201] [StageEngineCoreClient] Stage-1 subprocess died unexpectedly (exit code None).
(APIServer pid=9684) ERROR 05-05 23:30:14 [stage_engine_core_client.py:201] [StageEngineCoreClient] Stage-0 subprocess died unexpectedly (exit code None).
(APIServer pid=9684) INFO 05-05 23:30:22 [omni_base.py:456] [AsyncOmni] Shutting down
(APIServer pid=9684) INFO 05-05 23:30:22 [async_omni_engine.py:1785] [AsyncOmniEngine] Shutting down Orchestrator
(APIServer pid=9684) INFO 05-05 23:30:22 [orchestrator.py:252] [Orchestrator] Received shutdown signal
(APIServer pid=9684) INFO 05-05 23:30:22 [orchestrator.py:1221] [Orchestrator] Shutting down all stages
(APIServer pid=9684) INFO 05-05 23:30:22 [orchestrator.py:1225] [Orchestrator] Stage 0 shut down
(APIServer pid=9684) INFO:     Shutting down
(APIServer pid=9684) INFO 05-05 23:30:22 [orchestrator.py:1225] [Orchestrator] Stage 1 shut down
(APIServer pid=9684) INFO 05-05 23:30:22 [launcher.py:137] Shutting down FastAPI HTTP server.
(APIServer pid=9684) INFO 05-05 23:30:22 [omni_base.py:456] [AsyncOmni] Shutting down
(APIServer pid=9684) INFO:     Shutting down
(APIServer pid=9684) INFO:     Waiting for application shutdown.
(APIServer pid=9684) INFO:     Application shutdown complete.
Pre-test GPU status:
[GPU Memory Monitor] Waiting for GPU 0 to free memory, Condition: Memory usage ratio ≤ 5.0%
[GPU Memory Status] Current usage:
  GPU 0: 0.5GiB/32.0GiB (1.6%)
[GPU Memory Freed] Devices 0 meet memory condition
   Condition: Memory usage ratio ≤ 5.0%
   Wait time: 0.0 seconds (0.0 minutes)
Post-test GPU status:

================================================================================
NVIDIA GPU Information (nvidia-smi)
================================================================================
Tue May  5 23:30:24 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.58.03              Driver Version: 595.58.03      CUDA Version: 13.2     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4080        On  |   00000000:4F:00.0 Off |                  N/A |
| 30%   33C    P8             13W /  320W |       1MiB /  32760MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

================================================================================
Detailed GPU Processes (nvidia-smi pmon)
================================================================================
# gpu         pid   type     sm    mem    enc    dec    jpg    ofa    command 
# Idx           #    C/G      %      %      %      %      %      %    name 
    0          -     -      -      -      -      -      -      -    -              


================================================================================
System Processes with GPU keywords
================================================================================
OmniServer stopped


========================================================= warnings summary ==========================================================
.venv/lib/python3.12/site-packages/_pytest/config/__init__.py:1434
  /root/vllm-omni/.venv/lib/python3.12/site-packages/_pytest/config/__init__.py:1434: PytestConfigWarning: Unknown config option: asyncio_mode
  
    self._warn_or_fail_if_strict(f"Unknown config option: {key}\n")

vllm_omni/version.py:55
  /root/vllm-omni/vllm_omni/version.py:55: RuntimeWarning: vLLM and vLLM-Omni appear to have mismatched major/minor versions:
   --> vLLM-Omni version 0.19.0rc2.dev202+g8e64c7ef2
   --> vLLM version 0.20.0
  This will likely cause compatibility issues.
    warn_if_misaligned_vllm_version()

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

.venv/lib/python3.12/site-packages/torch/jit/_script.py:365: 14 warnings
  /root/vllm-omni/.venv/lib/python3.12/site-packages/torch/jit/_script.py:365: DeprecationWarning: `torch.jit.script_method` is deprecated. Please switch to `torch.compile` or `torch.export`.
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
--- Running Summary
============================================ 1 passed, 18 warnings in 392.15s (0:06:32) =============================================

linyueqian · 2026-05-05T21:55:11Z

please fix ci

xiaohajiayou · 2026-05-06T01:45:44Z

please fix ci

Test failure for Omni · Function Test with H100 seems to be due to [Bug] [CI failure]: CI audio-text consistency assertion fails intermittently #3341
Test failure for the Diffusion Model CPU offloading Test suggests this issue was likely introduced by [Feat]add cpu-offload/layerwise-offload for stable-audio-open & fix output inconsistency with same seed #2909. It appears that the merge-test label CI check was not included when that PR was merged.
- At commit beaf1c9, the Diffusion Model CPU offloading Test was still passing. After syncing with the main branch, it started failing:
```
FAILED tests/e2e/offline_inference/test_diffusion_layerwise_offload.py::test_layerwise_offload_diffusion_model[stabilityai/stable-audio-open-1.0] 
- AssertionError: Audio outputs differ beyond tolerance atol=0.001
```

linyueqian · 2026-05-06T04:05:31Z

@yenuo26 @hsliuustc0106 i think this pr is good to merge. please take another look.

gcanlin · 2026-05-06T04:47:06Z

@amy-why-3459 https://buildkite.com/vllm/vllm-omni/builds/8939/canvas?sid=019df8f9-159d-43d0-8b59-530fe5f1dcdc&tab=output, please check this performance. It seems that it doesn't happen regression.

hsliuustc0106 · 2026-05-06T04:57:57Z

@yenuo26 @hsliuustc0106 i think this pr is good to merge. please take another look.

I will merge it after CI

amy-why-3459 · 2026-05-06T05:08:21Z

@amy-why-3459 https://buildkite.com/vllm/vllm-omni/builds/8939/canvas?sid=019df8f9-159d-43d0-8b59-530fe5f1dcdc&tab=output, please check this performance. It seems that it doesn't happen regression.

LGTM

Signed-off-by: xiaohajiayou <923390377@qq.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> Co-authored-by: SYLAR <125541396+lishunyang12@users.noreply.github.com> Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

xiaohajiayou requested a review from hsliuustc0106 as a code owner May 1, 2026 09:37

xiaohajiayou force-pushed the whitelist-optimization-v2 branch from 230ece7 to 6478b2c Compare May 1, 2026 09:39

Merge branch 'main' into whitelist-optimization-v2

cf65678

lishunyang12 added the merge-test label to trigger buildkite merge test CI label May 2, 2026

Fix mimo audio async chunk None handling

da464af

Signed-off-by: xiaohajiayou <923390377@qq.com>

xiaohajiayou force-pushed the whitelist-optimization-v2 branch from 8a461ab to da464af Compare May 2, 2026 08:37

Merge branch 'main' into whitelist-optimization-v2

a9a09c1

xiaohajiayou mentioned this pull request May 2, 2026

[Bug]: Voxtral-4B-TTS-2603 fails to start unless --skip-mm-profiling is explicitly set #3308

Closed

1 task

hsliuustc0106 reviewed May 2, 2026

View reviewed changes

Comment thread tests/e2e/online_serving/test_mimo_audio.py

Comment thread vllm_omni/engine/async_omni_engine.py Outdated

Comment thread vllm_omni/config/stage_config.py

xiaohajiayou force-pushed the whitelist-optimization-v2 branch from 2e08140 to 235a452 Compare May 2, 2026 14:01

Restore deploy runtime defaults for migrated models

3675d56

Signed-off-by: xiaohajiayou <923390377@qq.com>

xiaohajiayou force-pushed the whitelist-optimization-v2 branch from d73a938 to 3675d56 Compare May 2, 2026 14:07

Merge branch 'main' into whitelist-optimization-v2

072b5b7

hsliuustc0106 requested changes May 2, 2026

View reviewed changes

xiaohajiayou force-pushed the whitelist-optimization-v2 branch from a3117dd to 074b866 Compare May 2, 2026 16:25

Preserve deploy defaults for migrated configs

5604953

Signed-off-by: xiaohajiayou <923390377@qq.com>

xiaohajiayou force-pushed the whitelist-optimization-v2 branch from 074b866 to 5604953 Compare May 2, 2026 16:26

hsliuustc0106 added the ready label to trigger buildkite CI label May 2, 2026

xiaohajiayou mentioned this pull request May 2, 2026

[Followup] Deploy YAML field ownership: stage-level defaults, user knobs, and model-owned config #3313

Open

hsliuustc0106 reviewed May 2, 2026

View reviewed changes

xiaohajiayou added 2 commits May 3, 2026 01:18

Merge branch 'main' into whitelist-optimization-v2

8d39bd7

hsliuustc0106 and others added 2 commits May 3, 2026 10:45

Merge branch 'main' into whitelist-optimization-v2

33736c6

Add compilation config to deploy stage schema

3702299

Signed-off-by: xiaohajiayou <923390377@qq.com>

Merge branch 'main' into whitelist-optimization-v2

0b65ea5

lishunyang12 and others added 4 commits May 4, 2026 22:23

Merge branch 'main' into whitelist-optimization-v2

f065b2e

Merge branch 'main' into whitelist-optimization-v2

6ebf801

Merge branch 'main' into whitelist-optimization-v2

7894e05

Merge branch 'main' into whitelist-optimization-v2

beaf1c9

hsliuustc0106 removed the omni-test label to trigger buildkite omni model test in nightly CI label May 5, 2026

amy-why-3459 reviewed May 5, 2026

View reviewed changes

Comment thread vllm_omni/deploy/qwen3_omni_moe.yaml

Merge branch 'main' into whitelist-optimization-v2

065034f

Signed-off-by: xiaohajiayou <923390377@qq.com>

xiaohajiayou force-pushed the whitelist-optimization-v2 branch from 31dcc5d to 065034f Compare May 5, 2026 15:08

Merge branch 'main' into whitelist-optimization-v2

b169da6

lishunyang12 added the omni-test label to trigger buildkite omni model test in nightly CI label May 5, 2026

Merge branch 'main' into whitelist-optimization-v2

88daf61

linyueqian mentioned this pull request May 6, 2026

[CI][Bugfix] Relax stable-audio layerwise offload determinism tolerance to 1e-2 #3371

Merged

Merge branch 'main' into whitelist-optimization-v2

3a41163

linyueqian added tts-test label to trigger buildkite tts models test in nightly CI ready label to trigger buildkite CI labels May 6, 2026

gcanlin approved these changes May 6, 2026

View reviewed changes

amy-why-3459 approved these changes May 6, 2026

View reviewed changes

hsliuustc0106 merged commit b076006 into vllm-project:main May 6, 2026
8 checks passed

Conversation

xiaohajiayou commented May 1, 2026

Purpose

Changes

Uh oh!

chatgpt-codex-connector Bot commented May 1, 2026

Uh oh!

xiaohajiayou commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xiaohajiayou commented May 2, 2026

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Review Summary

Uh oh!

xiaohajiayou commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hsliuustc0106 May 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hsliuustc0106 commented May 3, 2026

Uh oh!

xiaohajiayou commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xiaohajiayou commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amy-why-3459 commented May 5, 2026

Uh oh!

Uh oh!

xiaohajiayou commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linyueqian commented May 5, 2026

Uh oh!

xiaohajiayou commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linyueqian commented May 6, 2026

Uh oh!

gcanlin commented May 6, 2026

Uh oh!

hsliuustc0106 commented May 6, 2026

Uh oh!

amy-why-3459 commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

xiaohajiayou commented May 1, 2026 •

edited

Loading

xiaohajiayou commented May 2, 2026 •

edited

Loading

xiaohajiayou commented May 4, 2026 •

edited

Loading

xiaohajiayou commented May 5, 2026 •

edited

Loading

xiaohajiayou commented May 5, 2026 •

edited

Loading

xiaohajiayou commented May 6, 2026 •

edited

Loading